Skip to content

Conversation

@djtodoro
Copy link

@djtodoro djtodoro commented Jul 1, 2025

GNU GCC Toolchain already supports big endian for RISC-V target. That support was merged without a change in psABI Document.
Here [0] is the initial PR for adding big endian support in LLVM project, so lets implement documentation part as well.

[0] llvm/llvm-project#146534

@rui314
Copy link
Collaborator

rui314 commented Jul 1, 2025

I think we need to clarify which relocations write data in big-endian order when the output is big-endian. My understanding is as follows:

R_RISCV_{32,64}
R_RISCV_ADD{16,32,64}
R_RISCV_SUB{16,32,64}
R_RISCV_SET{16,32,64}
R_RISCV_32_PCREL
R_RISCV_PLT32

@djtodoro
Copy link
Author

djtodoro commented Jul 4, 2025

I think we need to clarify which relocations write data in big-endian order when the output is big-endian. My understanding is as follows:

R_RISCV_{32,64} R_RISCV_ADD{16,32,64} R_RISCV_SUB{16,32,64} R_RISCV_SET{16,32,64} R_RISCV_32_PCREL R_RISCV_PLT32

Hi @rui314, I agree, thanks a lot for pointing that out.

@kito-cheng
Copy link
Collaborator

@djtodoro do you have plan to update the PR according @rui314's comment?

@djtodoro
Copy link
Author

@djtodoro do you have plan to update the PR according @rui314's comment?

Yes, sure.

@djtodoro djtodoro force-pushed the pr/riscv-be branch 2 times, most recently from 332088f to 97ccbce Compare August 12, 2025 08:20
NOTE: Big-endian calling conventions follow the same rules as little-endian
calling conventions. The only difference is in the byte ordering of multi-byte
values in memory and registers. Register usage, argument passing, and return
value conventions remain the same.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this address #265 (comment)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a draft for that: we already pass 2×XLEN bits scalar in what we expected:

calars that are 2×XLEN bits wide are passed in a pair of argument registers,
with the low-order XLEN bits in the lower-numbered register and the high-order
XLEN bits in the higher-numbered register.  If no argument registers are
available, the scalar is passed on the stack by value.  If exactly one
register is available, the low-order XLEN bits are passed in the register and
the high-order XLEN bits are passed on the stack.

So I tried to add a paragraph to clarify also give an example for that, also adding a NOTE to describe the rationale.

The other thing I added is for Variadic arguments with 2×XLEN-bit and Aggregates with XLEN < size <= XLEN *2.

I didn't check with GCC implementation yet, but IIRC that's may not match GCC's default big-endian behavior.

diff --git a/riscv-cc.adoc b/riscv-cc.adoc
index 0768360..037b47f 100644
--- a/riscv-cc.adoc
+++ b/riscv-cc.adoc
@@ -191,6 +191,16 @@ available, the scalar is passed on the stack by value.  If exactly one
 register is available, the low-order XLEN bits are passed in the register and
 the high-order XLEN bits are passed on the stack.
 
+This register-pair ordering is defined in terms of value significance and is
+independent of endianness.  For example, on RV32BE a 64-bit scalar returned
+in a0/a1 places bits [31:0] (the least-significant XLEN bits) in a0 and
+bits [63:32] in a1; memory layout remains big-endian.
+
+NOTE: Defining the register-pair ordering independent of endianness allows
+RV32_Zdinx and Zilsd paired load/store paths to be used directly for argument
+passing and return without extra swaps.  Memory layout remains governed by the
+target endianness.
+
 Scalars wider than 2×XLEN bits are passed by reference and are replaced in the
 argument list with the address.
 
@@ -198,7 +208,10 @@ Aggregates whose total size is no more than XLEN bits are passed in
 a register, with the fields laid out as though they were passed in memory. If
 no register is available, the aggregate is passed on the stack.
 Aggregates whose total size is no more than 2×XLEN bits are passed in a pair
-of registers; if only one register is available, the first XLEN bits are passed
+of registers with the fields laid out as though they were passed in memory:
+the lower-numbered register holds the lower-addressed XLEN-sized chunk of
+the aggregate and the higher-numbered register holds the next chunk;
+if only one register is available, the first XLEN bits are passed
 in a register and the remaining bits are passed on the stack. If no registers are
 available, the aggregate is passed on the stack. Bits unused due to
 padding, and bits past the end of an aggregate whose size in bits is not
@@ -231,7 +244,10 @@ same manner as named arguments, with one exception.  Variadic arguments with
 even-numbered), or on the stack by value if none is available. After a
 variadic argument has been passed on the stack, all future arguments will also
 be passed on the stack (i.e. the last argument register may be left unused due
-to the aligned register pair rule).
+to the aligned register pair rule).  For 2×XLEN scalars placed in an aligned
+register pair, the lower-numbered register holds the least-significant XLEN bits
+and the higher-numbered register holds the most-significant XLEN bits,
+regardless of endianness.
 
 Values are returned in the same manner as a first named argument of the same
 type would be passed.  If such an argument would have been passed by

NOTE: Big-endian calling conventions follow the same rules as little-endian
calling conventions. The only difference is in the byte ordering of multi-byte
values in memory and registers. Register usage, argument passing, and return
value conventions remain the same.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a draft for that: we already pass 2×XLEN bits scalar in what we expected:

calars that are 2×XLEN bits wide are passed in a pair of argument registers,
with the low-order XLEN bits in the lower-numbered register and the high-order
XLEN bits in the higher-numbered register.  If no argument registers are
available, the scalar is passed on the stack by value.  If exactly one
register is available, the low-order XLEN bits are passed in the register and
the high-order XLEN bits are passed on the stack.

So I tried to add a paragraph to clarify also give an example for that, also adding a NOTE to describe the rationale.

The other thing I added is for Variadic arguments with 2×XLEN-bit and Aggregates with XLEN < size <= XLEN *2.

I didn't check with GCC implementation yet, but IIRC that's may not match GCC's default big-endian behavior.

diff --git a/riscv-cc.adoc b/riscv-cc.adoc
index 0768360..037b47f 100644
--- a/riscv-cc.adoc
+++ b/riscv-cc.adoc
@@ -191,6 +191,16 @@ available, the scalar is passed on the stack by value.  If exactly one
 register is available, the low-order XLEN bits are passed in the register and
 the high-order XLEN bits are passed on the stack.
 
+This register-pair ordering is defined in terms of value significance and is
+independent of endianness.  For example, on RV32BE a 64-bit scalar returned
+in a0/a1 places bits [31:0] (the least-significant XLEN bits) in a0 and
+bits [63:32] in a1; memory layout remains big-endian.
+
+NOTE: Defining the register-pair ordering independent of endianness allows
+RV32_Zdinx and Zilsd paired load/store paths to be used directly for argument
+passing and return without extra swaps.  Memory layout remains governed by the
+target endianness.
+
 Scalars wider than 2×XLEN bits are passed by reference and are replaced in the
 argument list with the address.
 
@@ -198,7 +208,10 @@ Aggregates whose total size is no more than XLEN bits are passed in
 a register, with the fields laid out as though they were passed in memory. If
 no register is available, the aggregate is passed on the stack.
 Aggregates whose total size is no more than 2×XLEN bits are passed in a pair
-of registers; if only one register is available, the first XLEN bits are passed
+of registers with the fields laid out as though they were passed in memory:
+the lower-numbered register holds the lower-addressed XLEN-sized chunk of
+the aggregate and the higher-numbered register holds the next chunk;
+if only one register is available, the first XLEN bits are passed
 in a register and the remaining bits are passed on the stack. If no registers are
 available, the aggregate is passed on the stack. Bits unused due to
 padding, and bits past the end of an aggregate whose size in bits is not
@@ -231,7 +244,10 @@ same manner as named arguments, with one exception.  Variadic arguments with
 even-numbered), or on the stack by value if none is available. After a
 variadic argument has been passed on the stack, all future arguments will also
 be passed on the stack (i.e. the last argument register may be left unused due
-to the aligned register pair rule).
+to the aligned register pair rule).  For 2×XLEN scalars placed in an aligned
+register pair, the lower-numbered register holds the least-significant XLEN bits
+and the higher-numbered register holds the most-significant XLEN bits,
+regardless of endianness.
 
 Values are returned in the same manner as a first named argument of the same
 type would be passed.  If such an argument would have been passed by

@djtodoro
Copy link
Author

djtodoro commented Sep 5, 2025

@kito-cheng Thanks, I agree!

I didn't check with GCC implementation yet, but IIRC that's may not match GCC's default big-endian behavior.

We will check GCC implementation, and fix it there.

@kito-cheng
Copy link
Collaborator

@aswaterman could you take a look on the big-endian calling convention part :)

@djtodoro
Copy link
Author

@kito-cheng Thanks, I agree!

I didn't check with GCC implementation yet, but IIRC that's may not match GCC's default big-endian behavior.

We will check GCC implementation, and fix it there.

Okay. For this basic test case:

$ cat test.c
long long test()
{
  return 0x1;
}

GCC for LE generates:

$ riscv64-unknown-linux-gnu-gcc -c test.c -O2 -march=rv32gc -mabi=ilp32
$ riscv64-unknown-linux-gnu-objdump -d test.o

test.o:     file format elf32-littleriscv


Disassembly of section .text:

00000000 <test>:
   0: 4505                 li a0,1
   2: 4581                 li a1,0
   4: 8082                 ret

And for BE, it generates:

$ riscv64-unknown-linux-gnu-gcc -c test.c -O2 -march=rv32gc -mabi=ilp32 -mbig-endian
$ riscv64-unknown-linux-gnu-objdump -d test.o

test.o:     file format elf32-bigriscv


Disassembly of section .text:

00000000 <test>:
   0: 4585                 li a1,1
   2: 4501                 li a0,0
   4: 8082                 ret

So, basically it does not follow the proposal here. We managed to come up with a fix, but needs some extra testing (djtodoro/gcc@71a0f9f), but with that applied, we now have:

$ riscv64-unknown-linux-gnu-gcc -c test.c -O2 -march=rv32gc -mabi=ilp32 -mbig-endian
$ riscv64-unknown-linux-gnu-objdump -d test.o

test.o:     file format elf32-bigriscv


Disassembly of section .text:

00000000 <test>:
   0: 4581                 li a1,0
   2: 4505                 li a0,1
   4: 8082                 ret

@aswaterman
Copy link
Contributor

I haven't had time to think this through yet, but make sure whatever you propose does the right thing for variadic functions. In particular, you want the argument-register layout to match the memory layout of arguments passed on the stack. This might encourage you to stick with GCC's current implementation, rather than making the change that @djtodoro mentioned.

@kito-cheng kito-cheng mentioned this pull request Oct 8, 2025
@djtodoro
Copy link
Author

@aswaterman @kito-cheng Thanks for your comments!

I checked variadic functions and found that the current psABI proposal text needs adjustment to match the actual GCC implementation after our fix (djtodoro/gcc@71a0f9f).

Here is a small example:

$ cat variadic.c 
#include <stdarg.h>
 
volatile unsigned int SN[2];
volatile unsigned int SV[2];
volatile unsigned int SR[2];
 
__attribute__((noinline))
void consume_named(unsigned long long x) {
  SN[0] = (unsigned)x;
  SN[1] = (unsigned)(x >> 32);
}
 
__attribute__((noinline))
void consume_var(const char *tag, ...) {
  va_list ap; va_start(ap, tag);
  unsigned long long x = va_arg(ap, unsigned long long);
  SV[0] = (unsigned)x;
  SV[1] = (unsigned)(x >> 32);
  va_end(ap);
}
 
__attribute__((noinline))
unsigned long long ret64(void) {
  return 0x1122334455667788ULL;
}
 
int main(void) {
  consume_named(0x1122334455667788ULL);
 
  consume_var("p", 0x1122334455667788ULL);
 
  unsigned long long r = ret64();
  SR[0] = (unsigned)r;
  SR[1] = (unsigned)(r >> 32);
 
  return 0;
}

Compile it as (asm files in attachment):

# this does not include our proposed fix: https://github.com/djtodoro/gcc/commit/71a0f9fc4bf9ff1b92ac434e362261ed16ff396b
$ riscv64-unknown-linux-gnu-gcc -S -O2 -march=rv32gc -mabi=ilp32 -mbig-endian variadic.c -o big_withoutfix_variadic.s
# with the fix
$ riscv64-unknown-linux-gnu-gcc -S -O2 -march=rv32gc -mabi=ilp32 -mbig-endian variadic.c -o big_afterfix_variadic.s
# LE
$ riscv64-unknown-linux-gnu-gcc -S -O2 -march=rv32gc -mabi=ilp32 variadic.c -o le_variadic.s

So, consume_named and return values are okay, it is what we implemented with the GCC fix. But the variadic case reveals an important distinction that needs to be documented. Let's investigate variadic arguments in consume_var. The 64-bit value is passed after the fixed string argument:
- a0 = pointer to "p"
- a2/a3 = the 64-bit value (aligned pair)

In BE (both before and after our GCC fix):
- Memory: [0x11223344][0x55667788] (big-endian order)
- Register assignment: a2 = 0x11223344 (MSW), a3 = 0x55667788 (LSW)
- Stack after spilling: offset 24 = 0x11223344 (MSW), offset 28 = 0x55667788 (LSW)
This maintains big-endian memory layout on the stack, which is essential for va_arg to work correctly.

So the issue is: The current psABI proposal states that variadic 2×XLEN scalars should use "the lower-numbered register holds the least-significant XLEN bits... regardless of endianness."
But this would break va_arg functionality on BE systems.

For the psABI, I propose we clarify the distinction:

  • Named args/returns on BE: a0=LSW, a1=MSW (significance-based ordering - stays as is)
  • Variadic args on BE: Use memory-layout ordering to maintain stack consistency

So, the proposal could be:

   Variadic arguments with
   even-numbered), or on the stack by value if none is available. After a
   variadic argument has been passed on the stack, all future arguments will also
   be passed on the stack (i.e. the last argument register may be left unused due
  -to the aligned register pair rule).  For 2×XLEN scalars placed in an aligned
  -register pair, the lower-numbered register holds the least-significant XLEN bits
  -and the higher-numbered register holds the most-significant XLEN bits,
  -regardless of endianness.
  +to the aligned register pair rule).  For 2×XLEN variadic scalars placed in an 
  +aligned register pair, the register assignment follows memory layout ordering:
  +the lower-numbered register receives the XLEN bits from the lower memory address
  +and the higher-numbered register receives the XLEN bits from the higher memory
  +address. This ensures correct va_arg operation when arguments are spilled to stack.
  +
  +NOTE: This memory-layout ordering for variadic arguments differs from the 
  +significance-based ordering used for named arguments and return values on 
  +big-endian systems.

Please let me know your thoughts about this.

big_afterfix_variadic.s.txt
big_withoutfix_variadic.s.txt
le_variadic.s.txt

@aswaterman
Copy link
Contributor

That sounds plausibly correct to me, but @kito-cheng should sanity-check it.

@aswaterman
Copy link
Contributor

Also, make sure to run through the GCC test suite with this scheme. Your simple test appears to catch the interesting case, but the test suite covers much more ground.

@djtodoro
Copy link
Author

@aswaterman Thanks!

Also, make sure to run through the GCC test suite with this scheme. Your simple test appears to catch the interesting case, but the test suite covers much more ground.

Of course, I agree :)

@djtodoro
Copy link
Author

@aswaterman @kito-cheng Thanks for your comments!

I checked variadic functions and found that the current psABI proposal text needs adjustment to match the actual GCC implementation after our fix (djtodoro/gcc@71a0f9f).

Here is a small example:

$ cat variadic.c 
#include <stdarg.h>
 
volatile unsigned int SN[2];
volatile unsigned int SV[2];
volatile unsigned int SR[2];
 
__attribute__((noinline))
void consume_named(unsigned long long x) {
  SN[0] = (unsigned)x;
  SN[1] = (unsigned)(x >> 32);
}
 
__attribute__((noinline))
void consume_var(const char *tag, ...) {
  va_list ap; va_start(ap, tag);
  unsigned long long x = va_arg(ap, unsigned long long);
  SV[0] = (unsigned)x;
  SV[1] = (unsigned)(x >> 32);
  va_end(ap);
}
 
__attribute__((noinline))
unsigned long long ret64(void) {
  return 0x1122334455667788ULL;
}
 
int main(void) {
  consume_named(0x1122334455667788ULL);
 
  consume_var("p", 0x1122334455667788ULL);
 
  unsigned long long r = ret64();
  SR[0] = (unsigned)r;
  SR[1] = (unsigned)(r >> 32);
 
  return 0;
}

Compile it as (asm files in attachment):

# this does not include our proposed fix: https://github.com/djtodoro/gcc/commit/71a0f9fc4bf9ff1b92ac434e362261ed16ff396b
$ riscv64-unknown-linux-gnu-gcc -S -O2 -march=rv32gc -mabi=ilp32 -mbig-endian variadic.c -o big_withoutfix_variadic.s
# with the fix
$ riscv64-unknown-linux-gnu-gcc -S -O2 -march=rv32gc -mabi=ilp32 -mbig-endian variadic.c -o big_afterfix_variadic.s
# LE
$ riscv64-unknown-linux-gnu-gcc -S -O2 -march=rv32gc -mabi=ilp32 variadic.c -o le_variadic.s

So, consume_named and return values are okay, it is what we implemented with the GCC fix. But the variadic case reveals an important distinction that needs to be documented. Let's investigate variadic arguments in consume_var. The 64-bit value is passed after the fixed string argument: - a0 = pointer to "p" - a2/a3 = the 64-bit value (aligned pair)

In BE (both before and after our GCC fix): - Memory: [0x11223344][0x55667788] (big-endian order) - Register assignment: a2 = 0x11223344 (MSW), a3 = 0x55667788 (LSW) - Stack after spilling: offset 24 = 0x11223344 (MSW), offset 28 = 0x55667788 (LSW) This maintains big-endian memory layout on the stack, which is essential for va_arg to work correctly.

So the issue is: The current psABI proposal states that variadic 2×XLEN scalars should use "the lower-numbered register holds the least-significant XLEN bits... regardless of endianness." But this would break va_arg functionality on BE systems.

For the psABI, I propose we clarify the distinction:

  • Named args/returns on BE: a0=LSW, a1=MSW (significance-based ordering - stays as is)
  • Variadic args on BE: Use memory-layout ordering to maintain stack consistency

So, the proposal could be:

   Variadic arguments with
   even-numbered), or on the stack by value if none is available. After a
   variadic argument has been passed on the stack, all future arguments will also
   be passed on the stack (i.e. the last argument register may be left unused due
  -to the aligned register pair rule).  For 2×XLEN scalars placed in an aligned
  -register pair, the lower-numbered register holds the least-significant XLEN bits
  -and the higher-numbered register holds the most-significant XLEN bits,
  -regardless of endianness.
  +to the aligned register pair rule).  For 2×XLEN variadic scalars placed in an 
  +aligned register pair, the register assignment follows memory layout ordering:
  +the lower-numbered register receives the XLEN bits from the lower memory address
  +and the higher-numbered register receives the XLEN bits from the higher memory
  +address. This ensures correct va_arg operation when arguments are spilled to stack.
  +
  +NOTE: This memory-layout ordering for variadic arguments differs from the 
  +significance-based ordering used for named arguments and return values on 
  +big-endian systems.

Please let me know your thoughts about this.

big_afterfix_variadic.s.txt big_withoutfix_variadic.s.txt le_variadic.s.txt

ping @kito-cheng :) any thoughts on this? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants